Group 22: Predicting algerian forest fires

Introduction

We have all seen the rise in the number of forest fires in the past few years, and scientists and climate activists worldwide have raised their concerns. While natural in few ecosystems, these recent forest fires root their cause to the global rise in temperature and poor land management by authorities. Due to this, no matter whether a wildfire's origin is due to human intervention or natural, the drier climate makes it easy for a fire to spread over a region quite intensely.
These fires are a clear call for change because they can result in irreparable damage to forest ecosystems. The gases released due to these fires travel to the city, causing health crises. While governments worldwide introduce initiatives to promote sustainable practices and the general public abides by them, it is also vital for scientists to know which forests are prone to fires. Through this project, we intend to answer the question: "Can we predict a wildfire in this Algerian forests based on given data?"

Data set used:

Algerian forest fires: https://archive.ics.uci.edu/ml/machine-learning-databases/00547/Algerian_forest_fires_dataset_UPDATE.csv
This data set has fourteen variables, thirteen of which are numerical, and one is categorical, along with 243 observations.

Loading the data set into R and wrangling it

In [1]:
# install.packages("skimr")
# install.packages("GGally")
# install.packages("tidyverse")
# install.packages("repr")
# install.packages("tidymodels")
# install.packages("kknn")
library(tidyverse)
library(repr)
library(GGally)
library(tidymodels)
library(skimr)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

 ggplot2 3.3.3      purrr   0.3.4
 tibble  3.1.0      dplyr   1.0.5
 tidyr   1.1.3      stringr 1.4.0
 readr   1.4.0      forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

── Attaching packages ────────────────────────────────────── tidymodels 0.1.2 ──

 broom     0.7.6       recipes   0.1.15
 dials     0.0.9       rsample   0.0.9 
 infer     0.5.4       tune      0.1.3 
 modeldata 0.1.0       workflows 0.2.2 
 parsnip   0.1.5       yardstick 0.0.8 

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
 scales::discard() masks purrr::discard()
 dplyr::filter()   masks stats::filter()
 recipes::fixed()  masks stringr::fixed()
 dplyr::lag()      masks stats::lag()
 yardstick::spec() masks readr::spec()
 recipes::step()   masks stats::step()

In [2]:
# URL that contains the dataframe that we want to examine
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00547/Algerian_forest_fires_dataset_UPDATE.csv"

# Reading dataframe into R
algerian_forest_fires <- read_csv(url,skip = 1)

# Examining the data frame
# algerian_forest_data

# Inspecting the parsing failures

# algerian_forest_fires[123, 1:14] -- Empty row between two data sets
 
# algerian_forest_fires[168, 1:14] -- Needs to be fixed

algerian_forest_fires[168, 10] = '14.6'
algerian_forest_fires[168, 11] =  '9'
algerian_forest_fires[168, 12] =  '12.5'
algerian_forest_fires[168, 13] =  '10.4'
algerian_forest_fires[168, 14] =  "fire"
── Column specification ────────────────────────────────────────────────────────
cols(
  day = col_character(),
  month = col_character(),
  year = col_character(),
  Temperature = col_character(),
  RH = col_character(),
  Ws = col_character(),
  Rain = col_character(),
  FFMC = col_character(),
  DMC = col_character(),
  DC = col_character(),
  ISI = col_character(),
  BUI = col_character(),
  FWI = col_character(),
  Classes = col_character()
)


Warning message:
“2 parsing failures.
row col   expected     actual                                                                                                       file
123  -- 14 columns 1 columns  'https://archive.ics.uci.edu/ml/machine-learning-databases/00547/Algerian_forest_fires_dataset_UPDATE.csv'
168  -- 14 columns 13 columns 'https://archive.ics.uci.edu/ml/machine-learning-databases/00547/Algerian_forest_fires_dataset_UPDATE.csv'
”

Upon examination we infer that:

  • File that we read contains 2 dataframes.
  • The dataset's column types aren't parsed correctly
  • To extract the target data frame, we will split the file into two dataframes, tidy both of them, and lastly join them.

    Extracting and tidying Bejaja region data set

    In [3]:
    # Retrieving the bejaja data set
    bejaja_forest <- slice(algerian_forest_fires, 2:122)
    
    #Fixing the collumn types of the numerical variables
    bejaja_forest_data <- bejaja_forest %>% 
        select(day:FWI) %>% 
        map_df(as.double) 
    
    # Selecting the observation class data
    bejaja_forest_fires <- bejaja_forest %>% 
        select(Classes) 
    
    # Joining numerical variable and observation class data 
    bejaja_forest_tidy <- cbind(bejaja_forest_data,bejaja_forest_fires)
    #head(bejaja_forest_tidy)
    

    Extracting and tidying Sidi-Bel region data set

    In [4]:
    #Finding the number of rows in the original dataset
    number_of_rows <- nrow(algerian_forest_fires)
    
    # Retrieving the sidi-bel data set
    sidi_bel_forest <- slice(algerian_forest_fires, 125:number_of_rows)
    
    
    #Fixing the collumn types of the numerical variables
    sidi_bel_forest_data <- sidi_bel_forest %>% 
        select(day:FWI) %>% 
        map_df(as.double)
    
    
    # Selecting the observation class data
    sidi_bel_forest_fires <- sidi_bel_forest %>% 
        select(Classes)
    
    # Joining numerical variable and observation class data 
    sidi_bel_forest_tidy <- cbind(sidi_bel_forest_data,sidi_bel_forest_fires)
    

    Joining our two dataframes

    In [5]:
    algerian_forest_fires_tidy <- full_join(bejaja_forest_tidy, sidi_bel_forest_tidy)
    # algerian_forest_fires_tidy
    
    # The resulting data frame is tidy and ready for the further analysis
    
    Joining, by = c("day", "month", "year", "Temperature", "RH", "Ws", "Rain", "FFMC", "DMC", "DC", "ISI", "BUI", "FWI", "Classes")
    
    

    Explanatory data analysis

    Splitting our data

    Choosing the split proportion

  • When choosing a split percentage, you are forced to decide between a more accurate evaluation of your model's performance and the better-trained model.
  • We choose to split our data set into 75% of training data and 25% of testing data. We reasoned that since our data set is not large, allocating more observations to the testing data set would drastically decrease our model's performance.
  • In [6]:
    # Setting the seed 
    set.seed(2021)
    # Splitting the data set
    algerian_forest_fires_tidy <- mutate(algerian_forest_fires_tidy, Classes = as.factor(Classes))
    forest_split <- initial_split(algerian_forest_fires_tidy, prop = 0.75, strata = Classes)
    forest_train <- training(forest_split)
    forest_test <- testing(forest_split)
    

    Creating summary table

    In [7]:
    # Specifying the summary function
    my_skim <- skim_with(numeric = sfl(median, mean,  sd, min, max),
                                       append = FALSE)
    
    # Creating summary table
    summary_df <- my_skim(forest_train) %>% 
        tibble::as_tibble() %>% 
        select(skim_variable:numeric.max)
    summary_df
    
    A tibble: 14 × 11
    skim_variablen_missingcomplete_ratefactor.orderedfactor.n_uniquefactor.top_countsnumeric.mediannumeric.meannumeric.sdnumeric.minnumeric.max
    <chr><int><dbl><lgl><int><chr><dbl><dbl><dbl><dbl><dbl>
    Classes 01FALSE 2fir: 104, not: 79 NA NA NA NA NA
    day 01 NANANA 16.0 16.2622951 8.723283 1.0 31.0
    month 01 NANANA 8.0 7.5519126 1.122343 6.0 9.0
    year 01 NANANA 2012.02012.0000000 0.0000002012.02012.0
    Temperature01 NANANA 32.0 31.9125683 3.558461 22.0 42.0
    RH 01 NANANA 64.0 62.415300514.480081 24.0 89.0
    Ws 01 NANANA 15.0 15.4699454 2.751208 6.0 26.0
    Rain 01 NANANA 0.0 0.7868852 1.965567 0.0 16.8
    FFMC 01 NANANA 83.8 77.755191314.384399 28.6 96.0
    DMC 01 NANANA 11.1 14.845901613.145946 0.9 65.9
    DC 01 NANANA 32.2 49.510929048.925393 6.9 220.4
    ISI 01 NANANA 3.5 4.7032787 4.044735 0.0 19.0
    BUI 01 NANANA 11.8 16.794535514.968618 1.4 68.0
    FWI 01 NANANA 4.7 7.0426230 7.511803 0.0 31.1

    Imbalance of class labels

    After a quick look at a summary table, we can see that there are more fire class labels than non-fire observations.
    We believe that this is not an issue since the 30% difference in the number of observations does not warrant upsampling the data.
    On the contrary, upsampling the data in this example would probably result in worse performance since each non-fire observation
    would gain more weight, making our model favour non-fire observations.

    Ploting our data

    In [8]:
    # Plot options
    options(repr.plot.width = 18, repr.plot.height= 18)
    # Removing the date data
    forest_train <- select(forest_train,Temperature:Classes)
    # We are using ggpairs to plot our variables against each other and see correlation
    ggpairs(forest_train, columns = c("Temperature","RH","Ws","Rain","FFMC","DMC","DC","ISI","BUI","FWI"),
          aes(colour = Classes))  +
          theme(text = element_text(size = 18),
           axis.text = element_text(size = 10))
    

    Methods

    • The question we seek to answer through this project falls under the classification category. Therefore, after tidying the dataset, we divided it into a training set and a testing set.
    • The categorical variable here will be Classes that tells us whether there was a fire or not. The predictors chosen for the model are Duff Moisture Content (DMC) and Initial Spread Index (ISI).
    • DMC represents the fuel moisture of decomposed organic material present on the forest floor. ISI estimates the spread potential by integrating the moisture content of the forest and the wind speed at the given time.
    • They are derived from variables such as temperature, rain, wind speed, and humidity calculated and predicted by scientists daily.
    • During our EDA process, we found that these two variables give a definite description as to when a fire occurs. This is illustrated in this project with the help of ggpairs(). We also found out that a model with these predictors gives the highest accuracy.
    • Now that we have defined the categorical variable and its predictors, we tune our K-nearest classifier model based on the training set to find the best K value.
    • We made a model specification , preprocessed our data using the recipe() function, and used cross-validation to ensure that our model does not overfit and help understand our model's accuracy. A high number of folds in cross-validation ensures that the model output has a lower standard error.
    • After putting the entire process in a workflow, we determine the value of K here by choosing the result which had the highest accuracy. The K value obtained was 5.
    • This model will then be used to predict our test set's class labels, with the model specification using the K value obtained during the training phase, and then determine its accuracy on new observations.
    • For our visualization, we will create decision boundary plot.

    Tuning our model

    In [9]:
    # Setting the seed value
    set.seed(2021)
    
    # Creating a model for tuning 
    knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>% 
        set_engine("kknn") %>% 
        set_mode("classification")
    
    # Creating a data pre-proccessing recipe
    ff_recipe <- recipe(Classes ~  ISI + DMC , data = forest_train) %>% 
        step_scale(all_predictors()) %>%
        step_center(all_predictors())
    
    # Creating a cross validation set 
    ff_vfold <- vfold_cv(forest_train, v = 10, strata = Classes)
    
    gridvals <- tibble(neighbors = seq(1:15))
    
    # Fitting our model with different k values
    knn_results <- workflow() %>%
        add_recipe(ff_recipe) %>%
        add_model(knn_spec) %>%
        tune_grid(resamples = ff_vfold, grid = gridvals) %>%
        collect_metrics()
    
    # Selecting a model with the highest accuracy 
    best_k <- knn_results %>% 
        filter(.metric == "accuracy") %>% 
        arrange(desc(mean)) %>% 
        slice(1) %>% 
        select(neighbors) %>% 
        pull()
    
    best_k
    
    5

    Testing our model

    In [10]:
    # Setting seed value
    set.seed(2021)
    
    # Creating final model
    ff_knn <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) %>% 
        set_engine("kknn") %>% 
        set_mode("classification")
    
    # Fitting the model
    ff_fit <- workflow() %>% 
        add_recipe(ff_recipe) %>% 
        add_model(ff_knn) %>% 
        fit(data = forest_train)
    
    # Using our fit to predict on test set
    ff_predictions <- predict(ff_fit, forest_test) %>% 
        bind_cols(forest_test)
    
    # Estimating accuracy of our classifier
    ff_metrics <- ff_predictions %>% 
        metrics(truth = Classes, estimate = .pred_class)
    ff_metrics
    
    ff_conf_mat <- ff_predictions %>%
        conf_mat(truth = Classes, estimate = .pred_class)
    ff_conf_mat
    
    A tibble: 2 × 3
    .metric.estimator.estimate
    <chr><chr><dbl>
    accuracybinary0.9833333
    kap binary0.9662162
              Truth
    Prediction fire not fire
      fire       33        0
      not fire    1       26
    In [11]:
    # Creating vector with 100 possible values in ISI range
    isi_seq <- seq(from = min(algerian_forest_fires_tidy$ISI, na.rm = TRUE), 
                   to = max(algerian_forest_fires_tidy$ISI, na.rm = TRUE), 
                   length.out = 100)
    
    # Creating a vector with 100 possible values in DMC range 
    ffmc_seq <- seq(from = min(algerian_forest_fires_tidy$DMC, na.rm = TRUE), 
                    to = max(algerian_forest_fires_tidy$DMC, na.rm = TRUE), 
                    length.out = 100)
    
    #
    grid_points <-  expand.grid(ISI = isi_seq,
                                DMC = ffmc_seq)
    
    # Creating a data set with 10000 observations to construct a decision boundary plot
    grid_predicted <- ff_fit %>% 
        predict(grid_points) %>% 
        bind_cols(grid_points)
    
    In [12]:
    # Creating a desicion boundary plot 
    bui_isi_boundary_plot <- ggplot(grid_predicted, aes(x = DMC, y = ISI, color = .pred_class)) +
        geom_point(alpha = 0.3,size = 5.) + 
        geom_point(data = algerian_forest_fires_tidy, aes(x = DMC, y = ISI, color = Classes)) +
        labs(y = "Initial spread index", x = "Duff Moisture Code", color = "Class label") +
         theme(text = element_text(size = 18),
           axis.text = element_text(size = 10)) 
    
    bui_isi_boundary_plot
    

    Disscussion

    Results:

  • Overall, we were able to develop a K-means classifier that could help predict future forest fires using other indicators available. The classifier turned out to have 98% accuracy, which could serve to be very useful in detection and prevention of future forest fires.
  • Did we expect this outcome?

  • We did expect to find some relative accuracy in the classifier because we knew that the variables used in our data would have some correlation, but to achieve such a high level of accuracy is something we didn’t really for-see. Now that we look back at this project in hindsight, we think a major part of why the classifier was able to achieve this number is because of the number of variables we took into account. Allowing _ variables in our classifier would’ve allowed the algorithm to distinguish to an even finer level, possibly being the cause for the greater accuracy
  • Significance:

  • Being able to predict forest fires in the future using these other variables could be very useful, as it could not only aid in prevention of these fires, but also — on a macro level — contribute to environmental sustainability. In the US, forest fires have doubled since 1985 to cover over 10 million acres of land, hugely snow-balling the process of climate change. Utilising this algorithm to prevent even 10% of these forest fires could have a huge impact on helping sustain our environment. Furthermore, global Oxygen levels, lives of endangered species and even global warming could all see huge improvements by having the ability to predict and prevent future forest fires. We think our algorithm is of great utility in this area and an aspect where utilisation of Data could really contribute.
  • Further disscussion:

  • In future, we can look into how our model would perform in different regions worldwide. The predictors that we used to create our model are amongst the most commonly collected types of observations worldwide, so scalability should not be an issue. However, our model's relevancy could be diminished due to the small number of observations and the fact that they all were collected in the same region. Thus, our data is succuptible to small anomalys such as antropogenic factors and could be suited to predicting regions with moderate climate.